2020–2025 AI Milestones

ChatGPT Moments, Generative AI & Agentic AI — how large language models, diffusion models, and autonomous agents reshaped civilization

Published

September 24, 2025

Keywords: AI history, 2020s AI, GPT-3, GPT-4, ChatGPT, OpenAI, large language models, generative AI, DALL-E, Stable Diffusion, Midjourney, diffusion models, AlphaFold2, protein folding, CLIP, vision transformer, ViT, InstructGPT, RLHF, reinforcement learning from human feedback, Chinchilla, scaling laws, chain-of-thought prompting, LLaMA, Llama 2, Meta AI, Segment Anything Model, SAM, GitHub Copilot, Codex, AI agents, agentic AI, multimodal AI, GPT-4o, on-device AI, AI governance, EU AI Act, text-to-video, Sora, DeepSeek, open-source LLMs, AI safety, alignment, reasoning models

Introduction

The first half of the 2020s will be remembered as the era when artificial intelligence left the lab and entered everyday life. In just five years, AI progressed from a powerful but largely invisible technology to a cultural force that reshaped how billions of people work, create, learn, and communicate.

The period opened with a dramatic scaling experiment: in June 2020, OpenAI released GPT-3, a language model with 175 billion parameters that could write essays, code, and poetry from a simple text prompt. The era of few-shot learning had arrived — but the real earthquake came two years later. On November 30, 2022, OpenAI launched ChatGPT, a conversational interface to its large language models. It reached 100 million users in two months — the fastest consumer adoption in history — and ignited a generative AI revolution that swept through every industry on Earth.

Parallel breakthroughs in image generation transformed creativity itself. DALL·E (2021), Stable Diffusion (2022), and Midjourney became cultural phenomena, enabling anyone to generate photorealistic images from text descriptions. In science, AlphaFold2 solved the 50-year-old protein folding problem, predicting the structures of virtually all known proteins and winning its creators the 2024 Nobel Prize in Chemistry.

The march continued with GPT-4 (2023), which demonstrated multimodal reasoning across text and images; LLaMA and its successors, which democratized large language models through open weights; and the Segment Anything Model, which did for computer vision what BERT had done for language. By 2024, multimodal assistants could see, hear, and speak; on-device AI brought language models to smartphones and laptops; and governments worldwide began crafting AI governance frameworks, including the landmark EU AI Act.

By 2025, the frontier had shifted to agentic AI — systems that don’t just answer questions but plan, reason, use tools, and take autonomous actions. The AI agent era had begun, built atop everything that came before: transformers, scaling laws, RLHF alignment, and the hard-won lessons of deploying AI at planetary scale.

This article traces the defining milestones of 2020–2025 — from GPT-3’s few-shot learning revelation, through the ChatGPT moment that changed everything, to the rise of generative AI and the dawn of agentic systems.

Timeline of Key Milestones

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '14px'}}}%%
timeline
    title 2020–2025 AI Milestones — ChatGPT Moments, Generative AI and Agentic AI
    2020 : GPT-3 — 175 billion parameters, few-shot learning
         : Vision Transformer (ViT) — applies transformers to image classification
         : AlphaFold2 wins CASP14 — solves protein folding
    2021 : DALL·E and CLIP — text-to-image generation and vision-language alignment
         : Codex and GitHub Copilot — AI-powered code generation
         : AlphaFold Protein Structure Database launched — 200M+ protein structures
    2022 : ChatGPT launched — 100M users in two months
         : InstructGPT and RLHF — aligning LLMs with human preferences
         : Chinchilla — optimal compute-data scaling laws
         : DALL·E 2 and Stable Diffusion — photorealistic image generation goes mainstream
         : Chain-of-thought prompting — teaching LLMs to reason step by step
    2023 : GPT-4 — multimodal reasoning across text and images
         : Segment Anything Model (SAM) — foundation model for computer vision
         : LLaMA and Llama 2 — Meta open-weights revolution
         : Midjourney V5 and DALL·E 3 — generative art reaches new heights
         : Microsoft 365 Copilot — AI embedded in enterprise productivity
    2024 : GPT-4o — omni-modal real-time AI assistant
         : AlphaFold3 — predicts structures of protein-DNA-RNA complexes
         : On-device AI — language models on smartphones and laptops
         : EU AI Act enters force — first comprehensive AI regulation
         : Nobel Prize in Chemistry for AlphaFold creators
    2025 : AI agents — planning, tool use, and autonomous multi-step actions
         : DeepSeek and open reasoning models challenge frontier labs
         : Text-to-video generation — Sora and the next creative frontier
         : AI governance frameworks expand worldwide

GPT-3: The Scale Revolution (2020)

In June 2020, OpenAI released GPT-3 — a language model with 175 billion parameters trained on a massive corpus of internet text. GPT-3 demonstrated a stunning capability that its predecessors only hinted at: few-shot learning. Given just a few examples in a text prompt, it could translate languages, write code, compose poetry, generate business emails, and answer factual questions — all without any task-specific fine-tuning.

The jump from GPT-2’s 1.5 billion parameters to GPT-3’s 175 billion was not merely quantitative. It crossed a threshold where the model exhibited emergent capabilities — behaviors that appeared only at sufficient scale. GPT-3 could perform arithmetic, write SQL queries, and even generate functional code, despite never being explicitly trained for these tasks.

Aspect	Details
Released	June 2020
Developer	OpenAI
Parameters	175 billion (116× larger than GPT-2)
Training data	570 GB of filtered text (Common Crawl, WebText2, Books, Wikipedia)
Training cost	Estimated $4.6 million in compute
Key capability	Few-shot learning — task performance from prompt examples alone
Access model	API-only (no public weights released)
Significance	Demonstrated that scale alone could unlock emergent capabilities

“One of the things that was most surprising about GPT-3 is that it can do things it was never trained to do.” — Sam Altman, CEO of OpenAI

GPT-3 also ignited a debate about the nature of intelligence. Critics argued it was a sophisticated pattern matcher with no genuine understanding; proponents countered that its ability to generalize across tasks suggested something beyond mere memorization. Regardless of the philosophical disputes, GPT-3 proved that massive scale and simple next-token prediction could produce remarkably versatile systems — and set the stage for the ChatGPT moment that would come two years later.

graph LR
    A["GPT-1<br/>117M params<br/>(2018)"] --> B["GPT-2<br/>1.5B params<br/>(2019)"]
    B --> C["GPT-3<br/>175B params<br/>(2020)"]
    C --> D["InstructGPT<br/>RLHF-aligned<br/>(2022)"]
    D --> E["ChatGPT<br/>Conversational UI<br/>(Nov 2022)"]
    E --> F["GPT-4<br/>Multimodal<br/>(2023)"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#2980b9,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#f39c12,color:#fff,stroke:#333
    style F fill:#8e44ad,color:#fff,stroke:#333

Vision Transformer (ViT): Transformers Conquer Vision (2020)

In October 2020, researchers at Google Brain published “An Image is Worth 16×16 Words”, introducing the Vision Transformer (ViT). The paper demonstrated that a standard transformer architecture — originally designed for language — could achieve state-of-the-art image classification when applied directly to sequences of image patches, without any convolutional layers.

ViT divided each image into fixed-size patches (typically 16×16 pixels), flattened them into vectors, and processed the resulting sequence using a standard transformer encoder with self-attention. When pretrained on large datasets (JFT-300M), ViT outperformed the best convolutional networks on ImageNet while requiring substantially less computational budget to train.

Aspect	Details
Published	October 2020
Authors	Alexey Dosovitskiy et al. (Google Brain)
Architecture	Standard transformer encoder applied to 16×16 image patches
Key result	Surpassed state-of-the-art CNNs on ImageNet when pretrained at scale
Significance	Unified vision and language under a single transformer architecture
Follow-ups	DeiT, Swin Transformer, BEiT, DINO — transformer-based vision became dominant

“An image is worth 16×16 words” — the paper’s title captured the elegant simplicity of treating image patches as tokens.

ViT’s success had far-reaching consequences. It demonstrated that the transformer architecture was not specific to language but was a general-purpose sequence processor. This insight accelerated research into multimodal models — systems that could process text, images, audio, and video within a unified transformer framework — and paved the way for CLIP, DALL·E, and the multimodal assistants that emerged in 2023–2024.

AlphaFold2: Solving Protein Folding (2020–2021)

In November 2020, DeepMind’s AlphaFold2 achieved a breakthrough that scientists had pursued for 50 years: accurately predicting the three-dimensional structure of proteins from their amino acid sequences. At the CASP14 (Critical Assessment of protein Structure Prediction) competition, AlphaFold2 achieved a median GDT score of 92.4 out of 100 — a level of accuracy comparable to experimental techniques like X-ray crystallography — and made the best prediction for 88 out of 97 targets.

In July 2021, DeepMind and EMBL-EBI launched the AlphaFold Protein Structure Database, initially containing predictions for the human proteome and 20 model organisms. By July 2022, the database expanded to cover over 200 million protein structures — virtually every known protein across all life forms.

Aspect	Details
CASP14 results	November 2020
Developer	DeepMind (Google / Alphabet)
Architecture	Evoformer + structure module, end-to-end differentiable
CASP14 median GDT	92.4 / 100 (comparable to experimental methods)
Database launched	July 2021 — expanded to 200M+ structures by July 2022
Nobel Prize	2024 Nobel Prize in Chemistry — Demis Hassabis and John Jumper
Significance	Solved a 50-year grand challenge in biology

Nobel laureate Venki Ramakrishnan called AlphaFold2 “a stunning advance on the protein folding problem… It has occurred decades before many people in the field would have predicted.”

AlphaFold2’s impact on biology and medicine has been transformational. Researchers worldwide use it to understand disease mechanisms, design new drugs, and engineer novel proteins. In 2024, AlphaFold’s creators — Demis Hassabis and John Jumper — shared the Nobel Prize in Chemistry for their work on protein structure prediction, alongside David Baker for computational protein design. AlphaFold3 (2024) extended the approach to predict structures of protein complexes with DNA, RNA, and other molecules.

graph TD
    A["Amino Acid<br/>Sequence Input"] --> B["Multiple Sequence<br/>Alignment (MSA)"]
    B --> C["Evoformer<br/>(Attention-based)"]
    C --> D["Structure<br/>Prediction Module"]
    D --> E["3D Protein<br/>Structure Output"]
    E --> F["CASP14: 92.4 GDT<br/>(Comparable to<br/>X-ray crystallography)"]
    F --> G["200M+ Protein<br/>Structures in Database"]
    G --> H["2024 Nobel Prize<br/>in Chemistry"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#2980b9,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#f39c12,color:#fff,stroke:#333
    style F fill:#8e44ad,color:#fff,stroke:#333
    style G fill:#1a5276,color:#fff,stroke:#333
    style H fill:#e67e22,color:#fff,stroke:#333

DALL·E and CLIP: Vision Meets Language (2021)

In January 2021, OpenAI unveiled two groundbreaking systems that bridged the gap between vision and language: DALL·E, a model that generated images from text descriptions, and CLIP (Contrastive Language–Image Pre-training), which learned to connect images and text in a shared embedding space.

CLIP was trained on 400 million image-text pairs scraped from the internet, learning to match images with their corresponding text descriptions using contrastive learning. The result was a visual system that could classify images using natural language descriptions — including categories it had never seen during training (zero-shot classification).

DALL·E (a portmanteau of Salvador Dalí and WALL·E) was a 12-billion parameter version of GPT-3, modified to generate images from text prompts. It could produce images of “an armchair in the shape of an avocado” or “a snail made of harp strings” — demonstrating a compositional understanding of language and visual concepts that stunned researchers.

Aspect	Details
Announced	January 2021
Developer	OpenAI
CLIP	Contrastive learning on 400M image-text pairs; zero-shot image classification
DALL·E	12B parameter GPT-3 variant; text-to-image generation
Key insight	Vision and language can be learned jointly in a shared representation space
Impact	Foundation for DALL·E 2, Stable Diffusion, Midjourney, and multimodal AI

“CLIP effectively generalizes to virtually any visual classification task, merely by describing the classes in natural language.” — OpenAI Research Blog

CLIP’s vision-language alignment became the backbone of the generative image revolution that followed. Stable Diffusion used CLIP’s text encoder to steer its diffusion process; DALL·E 2 and DALL·E 3 built upon CLIP’s multimodal understanding. The insight that vision and language could be unified in a single representation space became one of the defining ideas of the era.

Codex and GitHub Copilot: AI Writes Code (2021)

In August 2021, OpenAI released Codex, a GPT-3 descendant fine-tuned on publicly available code from GitHub. Codex could translate natural language instructions into working code across dozens of programming languages. In June 2021, Microsoft and OpenAI launched GitHub Copilot — an AI pair-programming tool powered by Codex that suggested code completions directly inside the developer’s editor.

Copilot was among the first AI systems to be adopted at massive scale in professional workflows. Within two years, it had millions of users and was generating a significant fraction of the code written by its adopters. Developers described it as transformative — not replacing programmers but dramatically accelerating their work.

Aspect	Details
Codex released	August 2021
GitHub Copilot launched	June 2021 (technical preview), general availability June 2022
Developer	OpenAI (Codex), GitHub / Microsoft (Copilot)
Based on	GPT-3, fine-tuned on public GitHub code
Languages	Python, JavaScript, TypeScript, Go, Ruby, and dozens more
Significance	First widely adopted AI tool for professional software development
Impact	Millions of developers; accelerated coding velocity by 30–55% in studies

“Copilot writes the boring code so I can focus on the interesting code.” — common developer sentiment, 2022

GitHub Copilot demonstrated that LLMs could serve as practical, everyday productivity tools — not just research curiosities. Its success paved the way for Microsoft 365 Copilot (2023), which brought the same paradigm to documents, spreadsheets, presentations, and email, embedding AI assistance into the core of enterprise productivity.

ChatGPT: The Moment Everything Changed (November 2022)

On November 30, 2022, OpenAI released ChatGPT — a conversational interface to its GPT-3.5 language model, fine-tuned using Reinforcement Learning from Human Feedback (RLHF) to be helpful, harmless, and honest. Within five days, it had one million users. Within two months, it reached 100 million monthly active users — making it the fastest-growing consumer application in history.

ChatGPT did not introduce fundamentally new AI capabilities. GPT-3 already existed; RLHF had been published in the InstructGPT paper earlier in 2022. What ChatGPT achieved was something more profound: it made AI accessible to everyone. The conversational interface — simple, free, and requiring no technical expertise — invited hundreds of millions of people to interact directly with a large language model for the first time.

Aspect	Details
Launched	November 30, 2022
Developer	OpenAI
Underlying model	GPT-3.5, fine-tuned with RLHF
1 million users	Within 5 days
100 million users	Within 2 months (fastest-growing consumer app ever)
ChatGPT Plus	Launched February 2023 ($20/month)
Key innovation	Conversational UI + RLHF alignment made LLMs accessible to everyone
Impact	Ignited the generative AI boom; triggered industry-wide AI arms race

Kevin Roose of The New York Times called ChatGPT “the best artificial intelligence chatbot ever released to the general public.”

The ripple effects were immediate and seismic. Google declared a “code red” and rushed to launch its own chatbot, Bard (later Gemini). Microsoft invested $10 billion in OpenAI and integrated GPT-4 into Bing. Every major tech company pivoted to generative AI. Startups raised billions. Universities debated how to handle AI-generated assignments. And for the first time in history, hundreds of millions of ordinary people experienced the power — and the limitations — of conversational AI directly.

graph TD
    A["GPT-3<br/>(June 2020)"] --> B["InstructGPT + RLHF<br/>(March 2022)"]
    B --> C["ChatGPT<br/>(Nov 30, 2022)"]
    C --> D["100M Users<br/>in 2 Months"]
    D --> E["Google Bard<br/>Microsoft Bing Chat<br/>Industry AI Arms Race"]
    C --> F["ChatGPT Plus<br/>(Feb 2023)"]
    F --> G["GPT-4 Integration<br/>(March 2023)"]
    G --> H["Plugins, Browsing,<br/>Code Interpreter"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#f39c12,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#1a5276,color:#fff,stroke:#333
    style G fill:#2980b9,color:#fff,stroke:#333
    style H fill:#e67e22,color:#fff,stroke:#333

InstructGPT and RLHF: Aligning AI with Human Values (2022)

In March 2022, OpenAI published the InstructGPT paper, describing how they used Reinforcement Learning from Human Feedback (RLHF) to align language models with human intentions. The key insight was simple but powerful: instead of training only on next-token prediction, you could fine-tune a model to follow instructions, be truthful, and avoid harmful outputs — by using human preferences as a training signal.

The process involved three steps: (1) collect human demonstrations of ideal responses and fine-tune the model via supervised learning; (2) have human raters rank multiple model outputs to train a reward model; (3) use the reward model to fine-tune the language model via Proximal Policy Optimization (PPO).

Aspect	Details
Published	March 2022
Developer	OpenAI
Technique	RLHF — Reinforcement Learning from Human Feedback
Steps	Supervised fine-tuning → Reward model training → PPO optimization
Applied to	GPT-3 (1.3B parameters) initially; later GPT-3.5 and GPT-4
Key result	A 1.3B InstructGPT was preferred over the 175B GPT-3 by human raters
Significance	Established RLHF as the standard method for aligning LLMs

“Our 1.3 billion parameter InstructGPT model outputs are preferred to outputs of the 175 billion parameter GPT-3, despite having 100× fewer parameters.” — Ouyang et al. (2022)

RLHF became the default alignment technique across the industry. Google used it for Gemini, Anthropic refined it into RLAIF (RL from AI Feedback) and Constitutional AI, and Meta applied it to Llama. The InstructGPT paper established that alignment is not just about making models larger — it’s about making them better at following human intent — a lesson that underpins the entire modern AI stack.

Chinchilla and Scaling Laws: How to Train Efficiently (2022)

In March 2022, DeepMind published “Training Compute-Optimal Large Language Models”, introducing the Chinchilla model. The paper challenged the prevailing wisdom on scaling — demonstrating that most large language models were significantly undertrained relative to their size.

The key finding: for a given compute budget, the optimal strategy is to scale model size and training data equally. A 70-billion parameter model trained on 1.4 trillion tokens (Chinchilla) outperformed the 280-billion parameter Gopher trained on 300 billion tokens — despite using four times fewer parameters.

Aspect	Details
Published	March 2022
Developer	DeepMind
Model	Chinchilla (70B parameters, 1.4T training tokens)
Key finding	Models should be trained on ~20 tokens per parameter for compute-optimality
Result	Chinchilla (70B) outperformed Gopher (280B) on most benchmarks
Impact	Shifted industry focus from parameter count to data quality and quantity

“For every doubling of model size, the number of training tokens should also be doubled.” — Hoffmann et al. (2022)

Chinchilla’s scaling laws reshaped how every lab trained large models. Meta’s LLaMA (2023) explicitly followed Chinchilla-optimal ratios, training a 65B parameter model on 1.4 trillion tokens. The lesson propagated industry-wide: more data, not just more parameters, was the path to better models.

DALL·E 2 and Stable Diffusion: The Generative Image Revolution (2022)

In April 2022, OpenAI released DALL·E 2, which used a diffusion-based approach (replacing the original DALL·E’s autoregressive method) to generate photorealistic images from text prompts at much higher resolution and fidelity. Then in August 2022, Stability AI released Stable Diffusion — an open-source latent diffusion model that could run on consumer GPUs.

Stable Diffusion democratized image generation. Unlike DALL·E 2, which was available only through an API, Stable Diffusion’s weights and code were publicly available. Anyone with a modest GPU could generate, modify, and fine-tune their own image generation models. Within weeks, a vibrant ecosystem of tools, extensions, and communities emerged.

Aspect	Details
DALL·E 2 released	April 2022
Stable Diffusion released	August 2022
SD architecture	Latent Diffusion Model (VAE + U-Net + CLIP text encoder)
SD parameters	~860M (U-Net) + 123M (text encoder)
SD training cost	~$600,000 on 256 NVIDIA A100 GPUs
Key innovation	Diffusion in latent space — high quality at lower compute cost
Impact	Open-source image generation; ran on consumer hardware

“Stable Diffusion marked the moment when AI image generation left the laboratory and entered the hands of millions.” — MIT Technology Review

The generative image revolution raised profound questions about copyright, consent, and the future of creative work. Artists protested that their styles were being replicated without permission. Legal battles ensued — Getty Images sued Stability AI; artists sued both Stability AI and Midjourney. Meanwhile, the technology continued to advance: Midjourney V5 (2023) produced images indistinguishable from photographs, and DALL·E 3 (October 2023) was integrated directly into ChatGPT.

graph LR
    A["DALL·E 1<br/>(Jan 2021)<br/>Autoregressive"] --> B["DALL·E 2<br/>(Apr 2022)<br/>Diffusion"]
    C["Latent Diffusion<br/>(LMU Munich, 2021)"] --> D["Stable Diffusion<br/>(Aug 2022)<br/>Open Source"]
    B --> E["DALL·E 3<br/>(Oct 2023)<br/>ChatGPT Integration"]
    D --> F["SD XL · SD 3<br/>Community Ecosystem"]
    G["Midjourney<br/>(2022–2023)"] --> H["Generative AI<br/>Cultural Phenomenon"]
    E --> H
    F --> H

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#2980b9,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#f39c12,color:#fff,stroke:#333
    style G fill:#1a5276,color:#fff,stroke:#333
    style H fill:#e67e22,color:#fff,stroke:#333

Chain-of-Thought Prompting: Teaching LLMs to Reason (2022)

In January 2022, Jason Wei and colleagues at Google Brain published a paper demonstrating that adding the phrase “Let’s think step by step” to a prompt — or providing worked examples with explicit intermediate reasoning — could dramatically improve LLM performance on math, logic, and multi-step reasoning tasks. This technique became known as chain-of-thought (CoT) prompting.

The insight was deceptively simple: LLMs trained on next-token prediction had learned to reason, but only when the reasoning process was made explicit in the output. By prompting the model to “show its work,” performance on arithmetic, commonsense reasoning, and symbolic manipulation improved by up to 60% on some benchmarks.

Aspect	Details
Published	January 2022 (Wei et al.)
Developers	Google Brain
Technique	Include reasoning steps in few-shot examples or instruct “think step by step”
Key result	GSM8K math accuracy improved from ~18% to ~57% on PaLM 540B
Variants	Zero-shot CoT, self-consistency, tree-of-thought, chain-of-thought distillation
Impact	Foundation for reasoning models (o1, o3) and agentic workflows

“Chain-of-thought prompting allows large language models to decompose multi-step problems into intermediate steps, significantly improving reasoning.” — Wei et al. (2022)

Chain-of-thought prompting was more than a prompt engineering trick. It revealed that reasoning is an emergent capability of scale — appearing only in sufficiently large models — and it laid the conceptual foundation for the reasoning models that emerged in 2024–2025, including OpenAI’s o1 and o3 series, which internalized chain-of-thought as a core inference mechanism.

GPT-4: Multimodal Intelligence (March 2023)

On March 14, 2023, OpenAI released GPT-4 — its most capable large language model at the time. GPT-4 was the first commercially deployed model to accept both text and image inputs (multimodal), producing text outputs that demonstrated markedly improved reasoning, factuality, and instruction-following compared to its predecessors.

GPT-4 passed the bar exam in the 90th percentile, scored in the 99th percentile on the GRE verbal section, and demonstrated substantial improvements on coding benchmarks. Its multimodal capability allowed users to upload images and ask questions about them — a preview of the visual reasoning that would become central to AI assistants.

Aspect	Details
Released	March 14, 2023
Developer	OpenAI
Modalities	Text input + image input → text output
Bar exam performance	90th percentile (vs. GPT-3.5’s 10th percentile)
GRE verbal	99th percentile
Context window	8K and 32K token variants
Follow-ups	GPT-4 Turbo (Nov 2023), GPT-4o (May 2024), GPT-4o mini (Jul 2024)
Significance	First commercially deployed multimodal LLM at scale

“GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” — OpenAI Technical Report

GPT-4 was quickly integrated into ChatGPT Plus, Bing Chat (now Microsoft Copilot), and hundreds of enterprise applications. It validated the bet that scaling plus RLHF alignment could produce models with genuine utility across law, medicine, coding, education, and creative work.

Segment Anything Model (SAM): Foundation Model for Vision (April 2023)

In April 2023, Meta AI released the Segment Anything Model (SAM) alongside the SA-1B dataset — the largest segmentation dataset ever created, containing over 1 billion masks on 11 million images. SAM could segment any object in any image, prompted by a point, a bounding box, or a text description — without being trained on that specific object category.

SAM did for computer vision what GPT-3 did for language: it demonstrated that a single, large foundation model could generalize across virtually any visual segmentation task. It was a zero-shot, promptable vision model — a paradigm shift from task-specific models that required custom training for each new class of objects.

Aspect	Details
Released	April 2023
Developer	Meta AI Research (FAIR)
Dataset	SA-1B — 1.1 billion masks on 11 million images
Architecture	Image encoder (ViT-H) + prompt encoder + mask decoder
Key capability	Zero-shot segmentation of any object from points, boxes, or text
Follow-up	SAM 2 (2024) extended to video segmentation
Significance	Foundation model paradigm applied to computer vision segmentation

“SAM is to image segmentation what GPT-3 was to text generation — a demonstration that foundation models can generalize across an entire domain.” — Meta AI Blog

SAM accelerated research in autonomous driving, medical imaging, robotics, augmented reality, and video editing. Its release as open source ensured rapid adoption and inspired dozens of follow-up projects that extended the approach to 3D, video, and domain-specific applications.

LLaMA and Open-Source Language Models (2023)

In February 2023, Meta AI released LLaMA (Large Language Model Meta AI) — a family of language models ranging from 7B to 65B parameters. Unlike GPT-4, LLaMA’s weights were made available to the research community (and soon leaked publicly), igniting an open-source AI revolution.

LLaMA followed Chinchilla-optimal scaling: the 65B model was trained on 1.4 trillion tokens — far more data per parameter than previous models. The result was a 65B model that matched or exceeded the performance of much larger proprietary models on many benchmarks. In July 2023, Meta released Llama 2 with a commercial license, followed by Llama 3 in April 2024 — each iteration narrowing the gap with frontier proprietary models.

Aspect	Details
LLaMA released	February 2023
Llama 2 released	July 2023 (with commercial license)
Llama 3 released	April 2024 (8B and 70B), December 2024 (405B)
Developer	Meta AI
LLaMA sizes	7B, 13B, 33B, 65B parameters
Training data	1.0T – 1.4T tokens (publicly available data)
Key innovation	Chinchilla-optimal training; open weights with commercial license
Impact	Spawned thousands of open-source derivatives and fine-tunes

“Our mission is to open up access to AI so that more people and institutions can explore, research, and benefit from it.” — Meta AI

LLaMA’s release shattered the assumption that only closed-source labs could produce competitive language models. Within weeks, the open-source community produced Alpaca, Vicuna, WizardLM, and hundreds of fine-tuned variants. By 2024, open models regularly competed with proprietary ones on key benchmarks. In early 2025, DeepSeek-R1 from China demonstrated that open reasoning models could match frontier performance, further validating the open-source approach.

graph TD
    A["LLaMA<br/>(Feb 2023)<br/>Research License"] --> B["Llama 2<br/>(Jul 2023)<br/>Commercial License"]
    B --> C["Llama 3<br/>(Apr 2024)<br/>8B, 70B, 405B"]
    A --> D["Open-Source<br/>Explosion"]
    D --> E["Alpaca · Vicuna<br/>WizardLM"]
    D --> F["Mistral · Mixtral<br/>Qwen · DeepSeek"]
    C --> G["Narrowing Gap with<br/>Frontier Proprietary<br/>Models"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#2980b9,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#f39c12,color:#fff,stroke:#333
    style F fill:#8e44ad,color:#fff,stroke:#333
    style G fill:#1a5276,color:#fff,stroke:#333

Generative Art and Creative AI (2022–2023)

By 2023, generative AI had become a cultural phenomenon. Midjourney — a text-to-image service accessible via Discord — produced images so stunning that a Midjourney-generated artwork won a prize at the Colorado State Fair art competition in September 2022, igniting fierce debate about the nature of creativity and authorship.

DALL·E 3 (October 2023) was integrated directly into ChatGPT, allowing users to generate and refine images through natural conversation. Stable Diffusion XL (July 2023) introduced native 1024×1024 resolution and dramatically improved image quality. And the open-source community built an ecosystem of tools — ControlNet, LoRA fine-tuning, ComfyUI — that gave artists and developers unprecedented control over the generation process.

Aspect	Details
Midjourney V5	March 2023 — photorealistic quality
DALL·E 3	October 2023 — integrated into ChatGPT
Stable Diffusion XL	July 2023 — 1024×1024, 3.5B parameters
Colorado controversy	September 2022 — Midjourney artwork wins art prize
Key tools	ControlNet, LoRA, DreamBooth, ComfyUI, AUTOMATIC1111
Impact	Democratized visual creation; challenged traditional art markets

“AI-generated art is the most disruptive thing to happen to visual culture since the invention of photography.” — artist and critic Jason Allen, 2022

Generative AI raised existential questions for creative professionals. Could AI replace artists, designers, and photographers? Was AI-generated content copyrightable? The legal and cultural debates intensified: the U.S. Copyright Office ruled that purely AI-generated images could not be copyrighted, while courts grappled with whether training on copyrighted data constituted fair use.

Microsoft 365 Copilot: AI in Enterprise (2023)

In March 2023, Microsoft announced Microsoft 365 Copilot, bringing GPT-4-powered AI assistance into Word, Excel, PowerPoint, Outlook, and Teams. This was a watershed moment: for the first time, large language models were embedded directly into the productivity tools used by hundreds of millions of knowledge workers worldwide.

Copilot could draft documents, summarize email threads, generate presentations from outlines, analyze spreadsheets with natural language queries, and take meeting notes in Teams. It demonstrated that LLMs could augment — rather than replace — human knowledge work.

Aspect	Details
Announced	March 2023
General availability	November 2023
Powered by	GPT-4, Microsoft Graph (user context)
Applications	Word, Excel, PowerPoint, Outlook, Teams
Pricing	$30/user/month (enterprise)
Significance	LLMs embedded in enterprise productivity at planetary scale

“Copilot is not just a better autocomplete — it’s a new way of working, where AI and human intelligence amplify each other.” — Satya Nadella, CEO of Microsoft

The launch of Microsoft 365 Copilot, alongside Google’s Duet AI for Workspace (later Gemini for Google Workspace), marked the beginning of the AI-augmented workplace. By 2024, AI assistance in documents, code, email, and data analysis was rapidly becoming the default expectation in enterprise environments.

Multimodal AI and GPT-4o (2024)

On May 13, 2024, OpenAI released GPT-4o (“o” for “omni”) — a model natively designed to process and generate text, audio, and images in a unified architecture. GPT-4o could engage in real-time voice conversations with human-like latency (~320 ms average response time), analyze images, and generate both text and audio outputs.

GPT-4o represented a fundamental shift from the chatbot paradigm to a multimodal assistant paradigm. It could understand tone, emotion, and context in voice conversations; describe and analyze visual scenes; and seamlessly switch between modalities — all at substantially faster speed and lower cost than GPT-4.

Aspect	Details
Released	May 13, 2024
Developer	OpenAI
Modalities	Text + audio + image (input and output)
Response latency	~320 ms (comparable to human conversation)
Follow-ups	GPT-4o mini (Jul 2024, cost-optimized)
Key advance	Natively multimodal — not separate models stitched together
Significance	Shifted AI from text chatbot to omni-modal real-time assistant

“The technology is moving so fast — GPT-4o is the kind of AI interaction we used to see only in science fiction movies.” — tech reviewer reaction, May 2024

Google’s Gemini models (2023–2024) pursued a parallel path toward natively multimodal architecture. By late 2024, the expectation for frontier AI systems was that they would be multimodal by default — understanding and generating across text, image, audio, and increasingly video.

On-Device AI and Model Efficiency (2024)

In 2024, the AI industry underwent a paradigm shift toward running capable language models directly on consumer devices — smartphones, laptops, and edge hardware — rather than relying solely on cloud APIs. Apple Intelligence brought on-device models to iPhones and Macs; Google embedded Gemini Nano into Pixel phones; and Qualcomm, Intel, and AMD shipped dedicated neural processing units (NPUs) optimized for transformer inference.

This shift was enabled by years of research into model compression techniques: quantization (reducing precision from 32-bit to 4-bit or lower), distillation (training smaller models to mimic larger ones), pruning, and efficient architectures like mixture-of-experts.

Aspect	Details
Apple Intelligence	Announced June 2024, iOS 18 / macOS Sequoia
Gemini Nano	Deployed on Pixel devices for on-device chat and summarization
Hardware	NPUs in Qualcomm Snapdragon, Intel Core Ultra, Apple M-series
Techniques	Quantization (GPTQ, GGML, AWQ), distillation, pruning, LoRA adapters
Models	Phi-3 (Microsoft), Gemma (Google), Llama 3.2 (Meta) — designed for on-device
Significance	AI inference without cloud dependency; privacy-preserving, low-latency

“On-device AI means your AI assistant works even without an internet connection — and your data never leaves your phone.” — Apple, WWDC 2024

On-device AI addressed critical concerns about privacy, latency, and cost. By running models locally, sensitive data never had to be transmitted to the cloud. And for developers, eliminating per-query API costs fundamentally changed the economics of AI-powered applications.

AI Governance and the EU AI Act (2024)

As AI capabilities accelerated, governments worldwide moved to establish regulatory frameworks. The most significant was the EU AI Act, which entered into force on August 1, 2024 — the world’s first comprehensive legal framework for artificial intelligence.

The EU AI Act classified AI systems into risk tiers: unacceptable (banned — e.g., social scoring, real-time facial recognition in public spaces), high-risk (subject to strict requirements), limited risk (transparency obligations), and minimal risk (largely unregulated). General-purpose AI models like GPT-4 and Llama fell under specific provisions requiring transparency and documentation.

Aspect	Details
EU AI Act	Entered into force August 1, 2024
U.S. Executive Order	October 30, 2023 (AI safety and security)
China AI regulations	Generative AI rules effective August 2023
G7 Hiroshima AI Process	International voluntary code of conduct
Key principles	Risk-based classification, transparency, human oversight
Frontier model provisions	Safety testing, red-teaming, incident reporting
Significance	First comprehensive AI regulation; global template for governance

“The AI Act is not about saying ‘no’ to AI — it’s about building trust so that AI can be adopted more widely.” — European Commission

In parallel, AI companies adopted voluntary safety commitments: red-teaming, model evaluations, watermarking of AI-generated content, and responsible disclosure. The 2024 Nobel Prize for Chemistry (AlphaFold) and Physics (neural networks by Hinton and Hopfield) highlighted both AI’s transformative potential and the urgency of thoughtful governance.

AI Agents: The Next Frontier (2025)

By 2025, the focus of AI research and deployment shifted decisively toward agentic AI — systems that don’t just respond to queries but plan, reason, use tools, and execute multi-step tasks autonomously. AI agents could browse the web, write and execute code, manage files, call APIs, and chain together complex workflows with minimal human intervention.

OpenAI launched Operator (January 2025) for browser-based task automation, followed by Codex (May 2025) for autonomous software engineering, and a general-purpose ChatGPT Agent (July 2025). Google, Anthropic, and Microsoft deployed their own agent frameworks. The paradigm shifted from “chatbot you talk to” to “assistant that works for you.”

Aspect	Details
OpenAI Operator	January 2025 — autonomous web browsing and task execution
OpenAI Codex agent	May 2025 — autonomous software engineering
ChatGPT Agent	July 2025 — general-purpose task agent
Anthropic Claude	Tool use, computer use, multi-step reasoning capabilities
Google Project Mariner	Agent framework for complex task delegation
Key capabilities	Planning, tool use, code execution, multi-step reasoning, memory
Significance	Shift from conversational AI to autonomous task execution

“We’re moving from AI that answers questions to AI that actually does things for you.” — Dario Amodei, CEO of Anthropic

The agentic paradigm brought new challenges: reliability (agents could make mistakes that compound over multiple steps), safety (autonomous actions require guardrails), and trust (users needed confidence that agents would act within approved boundaries). But the potential was enormous: AI agents promised to automate entire workflows — from research and data analysis to coding, scheduling, and content creation — fundamentally reshaping knowledge work.

graph TD
    A["User Intent<br/>(Natural Language)"] --> B["Planning Module<br/>(Decompose into Steps)"]
    B --> C["Tool Selection<br/>(APIs, Code, Browser)"]
    C --> D["Execution<br/>(Multi-step Actions)"]
    D --> E["Observation<br/>& Reflection"]
    E -->|"Iterate"| B
    E --> F["Final Result<br/>Delivered to User"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#f39c12,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#1a5276,color:#fff,stroke:#333

DeepSeek and the Open Reasoning Revolution (2025)

In January 2025, Chinese AI lab DeepSeek released DeepSeek-R1 — an open-weight reasoning model that matched or exceeded the performance of OpenAI’s o1 on mathematics, coding, and scientific reasoning benchmarks. DeepSeek-R1 was notable for its transparency: it explicitly showed its chain-of-thought reasoning process, and its weights were freely available.

DeepSeek’s breakthrough demonstrated that frontier AI capabilities were no longer the exclusive domain of a handful of well-funded Western labs. It also validated the viability of open reasoning models — LLMs that could perform extended multi-step reasoning with full weight availability for the research community.

Aspect	Details
Released	January 2025
Developer	DeepSeek (China)
Key models	DeepSeek-V3, DeepSeek-R1 (reasoning)
R1 performance	Competitive with OpenAI o1 on math, code, and science
Training efficiency	Reportedly trained at significantly lower cost than Western models
License	Open weights
Significance	Proved open models could match frontier reasoning capabilities

“DeepSeek-R1 reminded the world that AI innovation is global — and that openness accelerates progress.” — AI researcher reaction, January 2025

DeepSeek’s success intensified the global race in AI development and prompted Western labs to accelerate their own reasoning model efforts. OpenAI responded with o3 (April 2025), and the competition between open and closed reasoning models became one of the defining dynamics of the AI landscape.

Text-to-Video and the Expanding Creative Frontier (2024–2025)

As image generation matured, the frontier moved to video. In February 2024, OpenAI previewed Sora — a diffusion transformer model that could generate realistic minute-long videos from text prompts. Sora produced coherent, cinematic footage with consistent characters, camera movements, and physical interactions that far surpassed previous video generation attempts.

By 2025, multiple labs had released video generation models: Google’s Veo, Runway’s Gen-3, and open-source alternatives. While none matched Hollywood production quality, they represented a paradigm shift — the ability to generate visual narratives from text descriptions, with implications for filmmaking, advertising, education, and entertainment.

Aspect	Details
Sora preview	February 2024 (OpenAI)
Sora public release	December 2024
Competitors	Google Veo, Runway Gen-3, Pika, Kling
Capabilities	Minute-long coherent video from text prompts
Technical approach	Diffusion transformer trained on video data
Challenges	Physics consistency, long-form coherence, compute cost
Significance	Extended generative AI from static images to temporal narratives

“The leap from image generation to video generation is like going from photography to cinema — it opens up entirely new forms of expression.”

Text-to-video, alongside advances in 3D generation and world models, pointed toward a future where AI could generate entire virtual environments, interactive experiences, and personalized media on demand.

The 2020–2025 Transformation at a Glance

The half-decade from 2020 to 2025 transformed AI from a powerful but specialized technology into a general-purpose tool reshaping civilization. The speed and breadth of change was without precedent:

Dimension	2019 State	2025 State
Largest language model	GPT-2 (1.5B parameters)	GPT-5, Gemini Ultra, Claude, Llama 3.1 (100B–1T+)
Chat AI users	Virtually none	Hundreds of millions weekly
Image generation	Research demos	Billions of images generated; integrated into consumer apps
Video generation	Primitive	Minute-long coherent videos from text
Code generation	Auto-complete	Autonomous coding agents
AI in science	Promising early results	Nobel Prize–winning breakthroughs (AlphaFold)
AI regulation	Minimal	EU AI Act, executive orders, international frameworks
On-device AI	None	Full language models running on smartphones
AI agents	Concept/research	Deployed agent products (Operator, Copilot, Codex)
Industry investment	Billions annually	Hundreds of billions annually

By 2025, AI was no longer a future technology. It was the present — woven into search, creativity, productivity, science, governance, and daily life. And the pace showed no signs of slowing.

Video: 2020–2025 AI Milestones — ChatGPT Moments, Generative AI & Agentic AI

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

References

Brown, T. et al. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (2020). arXiv:2005.14165
Dosovitskiy, A. et al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” ICLR (2021). arXiv:2010.11929
Jumper, J. et al. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596, 583–589 (2021).
Radford, A. et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML (2021). arXiv:2103.00020
Ouyang, L. et al. “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS 35 (2022). arXiv:2203.02155
Hoffmann, J. et al. “Training Compute-Optimal Large Language Models.” NeurIPS 35 (2022). arXiv:2203.15556
Rombach, R. et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR (2022). arXiv:2112.10752
Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 35 (2022). arXiv:2201.11903
OpenAI. “GPT-4 Technical Report.” arXiv:2303.08774 (2023).
Kirillov, A. et al. “Segment Anything.” ICCV (2023). arXiv:2304.02643
Touvron, H. et al. “LLaMA: Open and Efficient Foundation Language Models.” arXiv:2302.13971 (2023).
Touvron, H. et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv:2307.09288 (2023).
Abramson, J. et al. “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” Nature 630, 493–500 (2024).
Wikipedia. “ChatGPT.” en.wikipedia.org/wiki/ChatGPT
Wikipedia. “AlphaFold.” en.wikipedia.org/wiki/AlphaFold
Wikipedia. “Stable Diffusion.” en.wikipedia.org/wiki/Stable_Diffusion

See the decade of deep learning that preceded the generative AI era — 2010s AI Milestones
The infrastructure decade that enabled modern AI — 2000s AI Milestones
The data revolution and statistical learning — 1990s AI Milestones
From expert systems to the second AI winter — 1980s AI Milestones
The first AI winter and the seeds of recovery — 1970s AI Milestones
Where it all began — 1950s–1960s AI Milestones
How transformers power modern language models — Pre-training LLMs from Scratch
Modern methods for aligning LLMs — Post-Training LLMs for Human Alignment
From prompts to context — Prompt Engineering vs Context Engineering
Scaling inference for production — Scaling LLM Serving for Enterprise Production